Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

a specific protein function from a constructed machine learning

Schietgat, et al., 2010]. A linear model can provide a good

ation function, but is unable to model complex data. A nonlinear

m such as MLP or RBFNN or SVM can model complex data, but

power of explaining what has been done in a model.

handling is also a key issue of the concern when modelling a data

types can be very different from applications to applications. For

they might be categorical data or non-numerical data. The

cleavage pattern discovery problem always deals with non-

l data, i.e., the amino acids. Without an encoding process, the

tioned machine learning algorithms can do nothing for

ase cleavage pattern discovery problem. This thus challenges the

tioned machine learning algorithms.

over, the problem of model construction complexity has been of

concern in some of the aforementioned machine learning

ms. This is because they require the lengthy model construction

a long process of the generalisation test to overcome the model

ty problem, such as the MLP algorithms.

y, dimensionality is also a concern when employing the

tioned machine learning algorithms. Some of these algorithms

le to handle data in which the dimension is greater than the

of samples. This is because the statistical significance cannot be

ed during a learning process.

he working principle of inductive learning

ctive learning approaches have been well recognised to have a

feature to overcome these limitations which occur to the

tioned machine learning algorithms. The most commonly

d inductive learning approaches include the decision tree

m (DT) [Quinlan, 1986] and the classification and regression tree

m (CART) [Breiman, et al., 1984]. These algorithms have also

ployed in many biological/medical pattern analysis projects.

asic principle of DT and CART is “divide and conquer”, which

d concept exercised in the computer sciences since 1970s